NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

ReDel: A Toolkit for LLM-Powered Recursive Multi-Agent Systems

https://doi.org/10.18653/v1/2024.emnlp-demo.17

Zhu, Andrew; Dugan, Liam; Callison-Burch, Chris (January 2024, Association for Computational Linguistics)

Recently, there has been increasing interest in using Large Language Models (LLMs) to construct complex multi-agent systems to perform tasks such as compiling literature reviews, drafting consumer reports, and planning vacations. Many tools and libraries exist for helping create such systems, however none support *recursive* multi-agent systems—where the models themselves flexibly decide when to delegate tasks and how to organize their delegation structure. In this work, we introduce ReDel: a toolkit for recursive multi-agent systems that supports custom tool-use, delegation schemes, event-based logging, and interactive replay in an easy-to-use web interface. We show that, using ReDel, we are able to achieve significant performance gains on agentic benchmarks and easily identify potential areas of improvements through the visualization and debugging tools. Our code, documentation, and PyPI package are open-source at https://github.com/zhudotexe/redel, and free to use under the MIT license.
more » « less
Full Text Available
MiRAGeNews: Multimodal Realistic AI-Generated News Detection

https://doi.org/10.18653/v1/2024.findings-emnlp.959

Huang, Runsheng; Dugan, Liam; Yang, Yue; Callison-Burch, Chris (January 2024, Association for Computational Linguistics)

The proliferation of inflammatory or misleading “fake” news content has become increasingly common in recent years. Simultaneously, it has become easier than ever to use AI tools to generate photorealistic images depicting any scene imaginable. Combining these two—AI-generated fake news content—is particularly potent and dangerous. To combat the spread of AI-generated fake news, we propose the MiRAGeNews Dataset, a dataset of 12,500 high-quality real and AI-generated image-caption pairs from state-of-the-art generators. We find that our dataset poses a significant challenge to humans (60% F-1) and state-of-the-art multi-modal LLMs (< 24% F-1). Using our dataset we train a multi-modal detector (MiRAGe) that improves by +5.1% F-1 over state-of-the-art baselines on image-caption pairs from out-of-domain image generators and news publishers. We release our code and data to aid future work on detecting AI-generated content.
more » « less
Full Text Available
RAID: A Shared Benchmark for Robust Evaluation of Machine-Generated Text Detectors

https://doi.org/10.18653/v1/2024.acl-long.674

Dugan, Liam; Hwang, Alyssa; Trhlík, Filip; Zhu, Andrew; Ludan, Josh Magnus; Xu, Hainiu; Ippolito, Daphne; Callison-Burch, Chris (January 2024, Association for Computational Linguistics)

Many commercial and open-source models claim to detect machine-generated text with extremely high accuracy (99% or more). However, very few of these detectors are evaluated on shared benchmark datasets and even when they are, the datasets used for evaluation are insufficiently challenging—lacking variations in sampling strategy, adversarial attacks, and open-source generative models. In this work we present RAID: the largest and most challenging benchmark dataset for machine-generated text detection. RAID includes over 6 million generations spanning 11 models, 8 domains, 11 adversarial attacks and 4 decoding strategies. Using RAID, we evaluate the out-of-domain and adversarial robustness of 8 open- and 4 closed-source detectors and find that current detectors are easily fooled by adversarial attacks, variations in sampling strategies, repetition penalties, and unseen generative models. We release our data along with a leaderboard to encourage future research.
more » « less
Full Text Available
Exploring the Curious Case of Code Prompts

https://doi.org/10.18653/v1/2023.nlrse-1.2

Zhang, Li; Dugan, Liam; Xu, Hainiu; Callison-burch, Chris (June 2023, Proceedings of the 1st Workshop on Natural Language Reasoning and Structured Explanations (NLRSE))

Recent work has shown that prompting language models with code-like representations of natural language leads to performance improvements on structured reasoning tasks. However, such tasks comprise only a small subset of all natural language tasks. In our work, we seek to answer whether or not code-prompting is the preferred way of interacting with language models in general. We compare code and text prompts across three popular GPT models (davinci, code-davinci-002, and text-davinci-002) on a broader selection of tasks (e.g., QA, sentiment, summarization) and find that with few exceptions, code prompts do not consistently outperform text prompts. Furthermore, we show that the style of code prompt has a large effect on performance for some (but not all) tasks and that fine-tuning on text instructions leads to better relative performance of code prompts.
more » « less
Full Text Available
Kani: A Lightweight and Highly Hackable Framework for Building Language Model Applications

https://doi.org/10.18653/v1/2023.nlposs-1.8

Zhu, Andrew; Dugan, Liam; Hwang, Alyssa; Callison-Burch, Chris (January 2023, Empirical Methods in Natural Language Processing)

Language model applications are becoming increasingly popular and complex, often including features like tool usage and retrieval augmentation. However, existing frameworks for such applications are often opinionated, deciding for developers how their prompts ought to be formatted and imposing limitations on customizability and reproducibility. To solve this we present Kani: a lightweight, flexible, and model-agnostic open-source framework for building language model applications. Kani helps developers implement a variety of complex features by supporting the core building blocks of chat interaction: model interfacing, chat management, and robust function calling. All Kani core functions are easily overridable and well documented to empower developers to customize functionality for their own needs. Kani thus serves as a useful tool for researchers, hobbyists, and industry professionals alike to accelerate their development while retaining interoperability and fine-grained control.
more » « less
Full Text Available
Real or Fake Text?: Investigating Human Ability to Detect Boundaries Between Human-Written and Machine-Generated Text

Dugan, Liam; Ippolito, Daphne; Kirubarajan, Arun; Shi, Sherry; Callison-Burch, Chris (February 2023, The 37th AAAI Conference on Artificial Intelligence)

As text generated by large language models proliferates, it becomes vital to understand how humans engage with such text, and whether or not they are able to detect when the text they are reading did not originate with a human writer. Prior work on human detection of generated text focuses on the case where an entire passage is either human-written or machine-generated. In this paper, we study a more realistic setting where text begins as human-written and transitions to being generated by state-of-the-art neural language models. We show that, while annotators often struggle at this task, there is substantial variance in annotator skill and that given proper incentives, annotators can improve at this task over time. Furthermore, we conduct a detailed comparison study and analyze how a variety of variables (model size, decoding strategy, fine-tuning, prompt genre, etc.) affect human detection performance. Finally, we collect error annotations from our participants and use them to show that certain textual genres influence models to make different types of errors and that certain sentence-level features correlate highly with annotator selection. We release the RoFT dataset: a collection of over 21,000 human annotations paired with error classifications to encourage future work in human detection and evaluation of generated text.
more » « less
Full Text Available
Real or Fake Text?: Investigating Human Ability to Detect Boundaries Between Human-Written and Machine-Generated Text

Dugan, Liam; Ippolito, Daphne; Kirubarajan, Arun; Shi, Sherry; Callison-Burch, Chris (February 2023, Thirty-Seventh AAAI Conference on Artificial Intelligence)

As text generated by large language models proliferates, it becomes vital to understand how humans engage with such text, and whether or not they are able to detect when the text they are reading did not originate with a human writer. Prior work on human detection of generated text focuses on the case where an entire passage is either human-written or machine-generated. In this paper, we study a more realistic setting where text begins as human-written and transitions to being generated by state-of-the-art neural language models. We show that, while annotators often struggle at this task, there is substantial variance in annotator skill and that given proper incentives, annotators can improve at this task over time. Furthermore, we conduct a detailed comparison study and analyze how a variety of variables (model size, decoding strategy, fine-tuning, prompt genre, etc.) affect human detection performance. Finally, we collect error annotations from our participants and use them to show that certain textual genres influence models to make different types of errors and that certain sentence-level features correlate highly with annotator selection. We release the RoFT dataset: a collection of over 21,000 human annotations paired with error classifications to encourage future work in human detection and evaluation of generated text.
more » « less
Full Text Available
Enhancing Human Summaries for Question-Answer Generation in Education

https://doi.org/10.18653/v1/2023.bea-1.9

Gonzalez, Hannah; Dugan, Liam; Miltsakaki, Eleni; Cui, Zhiqi; Ren, Jiaxuan; Li, Bryan; Upadhyay, Shriyash; Ginsberg, Etan; Callison-Burch, Chris (July 2023, Proceedings of the 18th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2023))

We address the problem of generating high-quality question-answer pairs for educational materials. Previous work on this problem showed that using summaries as input improves the quality of question generation (QG) over original textbook text and that human-written summaries result in higher quality QG than automatic summaries. In this paper, a) we show that advances in Large Language Models (LLMs) are not yet sufficient to generate quality summaries for QG and b) we introduce a new methodology for enhancing bullet point student notes into fully fledged summaries and find that our methodology yields higher quality QG. We conducted a large-scale human annotation study of generated question-answer pairs for the evaluation of our methodology. In order to aid in future research, we release a new dataset of 9.2K human annotations of generated questions.
more » « less
Full Text Available
Watching Paint Dry: Organic Vapor Emissions from Architectural Coatings and their Impact on Secondary Organic Aerosol Formation

https://doi.org/10.1021/acs.est.2c02478

Tanzer-Gruener, Rebecca; Rajan, Pavithra Ethi; Dugan, Liam D.; Bier, Mark E.; Robinson, Allen L.; Presto, Albert A. (August 2022, Environmental Science & Technology)

Full Text Available
A Feasibility Study of Answer-Unaware Question Generation for Education

https://doi.org/10.18653/v1/2022.findings-acl.151

Dugan, Liam; Miltsakaki, Eleni; Upadhyay, Shriyash; Ginsberg, Etan; Gonzalez, Hannah; Choi, DaHyeon; Yuan, Chuning; Callison-Burch, Chris (January 2022, Findings of the Association for Computational Linguistics: ACL 2022)

We conduct a feasibility study into the applicability of answer-agnostic question generation models to textbook passages. We show that a significant portion of errors in such systems arise from asking irrelevant or un-interpretable questions and that such errors can be ameliorated by providing summarized input. We find that giving these models human-written summaries instead of the original text results in a significant increase in acceptability of generated questions (33% → 83%) as determined by expert annotators. We also find that, in the absence of human-written summaries, automatic summarization can serve as a good middle ground.
more » « less
Full Text Available

« Prev Next »

Search for: All records